Module 1 - Lab 1

From Raw Text to NLP Pipelines (SEC 10-K)

1
M01:
Lab
Hands-on lab activity: Interacting with Textual Data in Jupyter and Colab.
Published

November 21, 2024

Modified

February 17, 2026

Lab Objective

In this lab, you will:

  • Connect Google Colab to VS Code
  • Load real-world corporate text data (SEC 10-K filings)
  • Implement a classical NLP preprocessing pipeline
  • Answer exploratory questions about corporate disclosures using text analytics

This lab establishes the computational and conceptual foundation for later work with embeddings and generative models.

Background Context

Public companies file Form 10-K annually with the U.S. Securities and Exchange Commission (SEC).
These filings contain rich textual information about:

  • business operations
  • risks and uncertainties
  • management discussion
  • regulatory disclosures

In this lab, we treat each 10-K as raw text data and apply a standard NLP pipeline to prepare it for analysis.

Connecting Google Colab to VS Code

You will use Google Colab as the execution backend while working inside VS Code.

Install the Colab Extension

Install the Google Colab extension in VS Code from one of the following:

  • Visual Studio Marketplace
  • Open VSX Registry

Search for “Colab” and install the official extension.

Open or Create a Notebook

In VS Code:

  • Open an existing .ipynb file
    or
  • Create a new Jupyter Notebook

Sign In to Google

When prompted:

  • Sign in using your Google account
  • Authorize Colab access

Select the Colab Kernel

In the notebook interface:

  • Click Select Kernel
  • Choose Colab
  • Select New Colab Server

Your notebook is now running on Google Colab

Dataset Overview

  • All data for this lab is located in: SEC-10K-2024/
  • You will need to “copy” the folder to your own Google Drive
  • Right click on the folder, and then click “Add shortcut to Drive”. This will allow you to access the folder from your drive!
  • This folder contains plain-text 10-K filings for multiple publicly traded firms.
  • Each file represents one company’s annual report.

Note

When running notebooks on Google Colab, file paths such as ../data/... will not work. Colab runs on a remote virtual machine, so all data must be accessed via mounted storage or downloads.

Research Framing (Important)

You are not training a model yet. Instead, think of this lab as asking structured questions of text, such as:

  • What terms dominate risk disclosures?
  • How consistent is language across companies?
  • Which words survive aggressive cleaning?
  • How does preprocessing change the text representation?

Your answers will be supported by intermediate outputs, not final predictions.

NLP Processing Pipeline

You will implement the following pipeline step by step:

  1. Raw text
  2. Sentence segmentation
  3. Tokenization
  4. Part-of-Speech (POS) tagging
  5. Stop-word removal
  6. Stemming / Lemmatization
  7. Dependency parsing
  8. String metrics & matching

NLP A Raw Text B Sentence Segmentation A->B C Tokenization B->C D Part-of-Speech Tagging C->D E Stop Word Removal D->E F Stemming / Lemmatization G Dependency Parsing F->G E->F H String Metrics & Matching G->H

Each stage produces artifacts that help you answer analytical questions.

Load and Inspect the Data

Mount Drive in Colab

from google.colab import drive
drive.mount('/content/drive')

Verify Files

import os
os.listdir("/content/drive/MyDrive")

Adjust paths as needed.

For Local Drive (Optional)

from pathlib import Path

DATA_DIR = Path("../data/SEC-10K-2024")

files = list(DATA_DIR.glob("*.txt"))
print(f"Number of 10-K documents: {len(files)}")

# Read a sample document
sample_text = files[0].read_text(encoding="utf-8")
print(sample_text[:1500])

1 What sections of the 10-K appear most frequently in the opening text?

This will help you understand the structure of the document and identify key areas for analysis (e.g., risk factors, management discussion). We first start with Sentence Segmentation

import nltk
nltk.download("punkt")

from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(sample_text)
print(f"Number of sentences: {len(sentences)}")
sentences[:5]

2 Are sentences in 10-Ks longer or shorter than typical news or social media text?

from nltk.tokenize import word_tokenize

tokens = word_tokenize(sample_text)
tokens[:30]

5 Which important business terms survive stop-word removal?

from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download("wordnet")

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

stems = [stemmer.stem(t) for t in filtered_tokens[:20]]
lemmas = [lemmatizer.lemmatize(t) for t in filtered_tokens[:20]]

list(zip(filtered_tokens[:20], stems, lemmas))

6 Which transformation preserves interpretability better for financial text?

import spacy
nlp = spacy.load("en_core_web_sm")

doc = nlp(sentences[0])
[(token.text, token.dep_, token.head.text) for token in doc]

7 How might dependency relationships help identify risk statements or obligations?

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

similarity(
    "risk management strategy",
    "enterprise risk management"
)

8 Why might approximate string matching be useful for cross-company comparison?

9 Deliverables

Submit word document with answering the questions in addition to the Jupyter notebook with the code and outputs (either .ipynb or .pdf):

10 Key Takeaway

Before we can generate language, we must first discipline text into structure.

This pipeline is the foundation upon which Bag of Words, TF-IDF, embeddings, and generative models are built.